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Peter G. Raeth 
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Abstract 



The purpose of this study was W investigate the feasibility of adapting, 
the eoefficient k introduced by Cohen (1960) and elaborated by Swamlnathan , 
Hambleton, and Algina (1974) to an internal consistency estimate for criterion^ 
referenced tests in single test administrations. The authors proposed the use 
of k as an internal consistency estimate by logically dividing criterion-^ 
referenced tests into two subtests j each tapping mirrored behavioral levels and 
content areas. Using a computer progrMi developed by the second author, results 
were tabulated on 93 seventh graders in an experimental study involviDg a series 
of multiple^choice tests in the areas of ecology and geometry* 
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M Internal Consistency Estiinate for Cr±terion=Referencad Tests 



Gregg M» Straslar 
and 

Peter Raeth 

University of South Carolina 

The purpose of this study was to investigate the feasibility of 
adapting the coefficient kappa <k) introduced by Cohen (1960) and 
elaborated by Swaminathan, Hainbleton^ and Alglna (1974) to an internal 
consistency estinmte for criterion-referenced tests in single test 
administra'Cions • 

In an article on the reliability of criterion-referenced tests 
(Pwaminathan et ml^f 1974) g the coafficient kappa (k) 3, an expression 
for test-retest reliability of criterion-referenced tests ^ was defined as: 

k = (Pq - Pe) / <1 - Pe) . (1) 
where Pp, the observed proportion of agreement is given by 

k 

Po - € n± (2) 
1-1 

and Pqp the expan^ted propDrtlon of agreement is given by 

k 

Pe - 1 Pi* P'i (3) 
i = 1 

In these formulas 5 P^£ represents the propoi^tlon of eKaminees placed In 
the ith mastery state on two test administrations and P^. and Pi^ represent 
the proportions of examinees assigned to the mastery state 1 on the first 
and second test admlnlstratlonsi respectively* Swamlnathan et al. (1974) 
define k as the proportion 0/ agreement that exists over and above that 
which can be expected bj' chance alone. 

Swaminathan et al, (1974) define the reliability of a crlterion='re- 
ferenced testv aR * * the measure of agreement between the decisions made 
in repeated tt-' administrations" (p. 264)* They further elaborate the need- 
for determining reliability esCimates based on "subtest "^scores" (vls-a-vis 
objectives) rather than total scores. Although the present writers agree 
with this conception of test-retest reliability s we propose adapting k as 
an "internal consistency estimate" for determining the consistency of 
decision making (i*e*, classification of "master" vs * "nonmaster") within 
a single administration of a crlterion-ref erenced test,^ 



^Some persons who are knowledgeable in measurement would disagree with v^e 
authors' usage of the term "internal consistency estimate*" In the present 
texti Internal consistency estimates include- reliability coefficients 
obtained from slngle^ test administrations which are not dependent upon test 
taking speed* 
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Methodology 



As stated by Glaser and Nitko (1971) , a crlterion^ref arenced test 
is defined as ^\ . «ona that'll delibarately constructed to yield meaaure-- 
ments that are directly Interpretable in terms of specified perforniance 
Standards" (p* 653)* In essemce, Glaser and Nitko contend that criterion^ 
referenced tests are designed to provide specific infcrTnation about an 
individual's perfoiniance within a domain of instructionally relevant tasks- 

Using this conception of criterion^ref erenced testSj the first author 
constructed and admlnist^ri.^' a series of 20 itenij multiple-choice teats 
(including .sunnmative pretests and posttestSj and "learning exercises" ^ 
formative tests^^in a research study involving 93 students in the areas 
of ecology and geometry In an effort to establish high content validity^ 
each' test was checked by "content eKperts" (seventh grade teachers)' for 
topic validity as well as process validity (Cureton, 1968) • After modifi^ 
cationB- were madej each test was considered to be a representative sample 
of both the topics and the cognitive processes of the unit of instruction 
it represented,. Each 20 item test contained sIk knowledge , seven compre- 
hension * and seven applicatUon and analysis items as defined by Bloom's 
taxonomy (Bloom, 1956) • Examples of items categorized by behavior levels 
are as follows t 

(1) Knowledne. (concepts food pyrainid) 

A picture showing that ^ in an ecosystem, the consumers are fewer 
than the producers ^ and the producers fewer than the decomposers 
is called a: . 

^a* food pyramid 
bt food web 
c» food chain 
d, niche* 

(2) Comprehension (concept! food pyramid) 

' If we show that the number of herons is smaller than the number 
of frogs which is smller in number than the number of crickets, 
etc* , we are showing a picture of ai 



b. 



niche 

food chain 
food web 
food pyramid 



C3) 



Application (concept i food pyrainid) 



Choose the statement in which "food pyramid" is used correctlyi 



The eagles were at the top of the food py r amid s 
b. Within the food py ramid ^ green plants grow* 
e* The sides of the f ood_ py rami d represent the consumers, 
d. Algae are usually the "top dogs" in the food pyramid . 



C^) Analysis (concepts! food pyramid , food web) 



A food pyramid is different than g ^ood web In that the cod i?yramid 

a. is a batter Indicator of what goes on in an ecosysteia 

describes the nuinbers of Individuals in each population 
C- Includes producers, consumers s and decomposers 
d- shows who eats whom in an ecosystem- 

In order to adapt the concept of kappa to measure the Internal con- 
sistency of a criterion-referenced test|, each of the 20 item tes^s was 
divided into two, 10 item "subtests*" Items within su^tssts wer a matched 
with respect to behavior levels (e.g., knowledge , comprehension * and applica" 
tion and analysis) and content areas* Therefore, the two subtests within each 
test weri^, approKimately the same with respect to difficulty and content 
covered* It would follow that a student who did well on one subtest would 
do equally well on the other subtest (and vice^versa) if the test had high 
internal reliability. If "mastery" is defined as 80% correct , then a 
tnaster on the first subtest (8, 9, or 10 items correct) should also master 
the second subtest if the test is internally consistent. 

In adapting kappa as an internal consistency estimate, P^^ in Equation 
(2) would represent the proportion of examinees placed in the ith mastery 
state on two "matched" subtests of a criterion-referenced test. Likewise, 
Pi. and P.i in Equation (S) would represent the proportions of eKaminees 
assigned to the mastery state i on the two respective subtests. In essence, 
kappa may now be interpreted in Equation (1) as the agreement of classlfica= 
tion ("mastery" vs* "nonmastery") between subtests after taking into con- 
sidferatlon the correction for classifications occurring by chance. 

All of the tests were administered to^ a total of 93 seventh'-grade stu-» 
dents during twenty--three class days of instruction in the content areas of 
ecology and geometry. Of the total number of students^ 47 students were 
assigned to a "learning for mastery" (Bloom, 1968) instructional mode and 
the remaining 46 students served as a control* Although both instructional 
modes were based on the same objectives and content for each unit of instruc- 
tion, the learning for mastery students received imaediate feedback and 
corrective procedures for each "learning exercise" (formative test) . The 
learning for mastery students also received additional time in the class-- 
room to correct their mistakes. The control students had neither of the 
above characteristics* - 



Results ^ 

Through a computer program developed by the second author^ the con- 
ception of k as an Internal consistency estimate was used to analygft the 
data. Table 1 sumnarizes some of the output generated by the computer pro=- 
gram using k as an internal consistency estimate* 



Insert Table 1 about here. 
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The numbsr or percentages of 'hnasters" or '•rionmasters" in the table = 
refer to studeats who met or did not meet the prespecified criterion (in 
this case, 701 and 80% correct)^ respectively. In both subtests of each 
type of test presented* Therefore, a 'Wster" has the added meaning of 
achieving "consistency of mastery'* on logically derived subtests as 
t^all as meeting a prespecified criterion (70% correct, 80% correct, etc,) 
on the total test acDre. 

As noted by Swaminathan et al* (1974), the range of k has a lower ; 
litnit of close to --1 extending to +1 as an upper limit* A negative value 
of k is, however, indicative of a highly suspect inconsistency in the 
decision making process. In fact, Millman (1974) points out that, if k < 0, 
the agreement rate would be defined as less than expected by chance. There- 
fore, as Huynh (1976) suggests, negative values of k should be equated to 0, 
whereas increasing increments of k in a positive direction should indicate 
increasing consistency in the decision making process and, hence, Increasing 
reliability* A value of k approximately 6 may be interpreted as what would 
be expected "by chance" alone. A value where k approKlmates 0 might also 
be interpretable in a pretest where no prior instruction has occurred. 

It is noteworthy to observe the values of k in Table 1. When a 
criterioh of 80% correct is specified, all of the "learning exerciges" (with 
the exception of the second test in the geometry unit) take on positive 
values of k after . instruction has-been received. In the summative pretests 
(prior to instruction) values of k^'approximate 0, whereas the values of k 
for the summative posttests in the ecology and geometry units are 0*438 and 
0*580, respectively. The l i overall pattern holds true when the criterion 
for mastery is set for 70% correct* 

Table 2 displays k as a function of criterion for mastery scores for 
the total group of 93 students as well as for the "learning for mastery" 
students (N = 47) and the control group (N - 46)* 



Insert Table 2 about here. 



As the criterion for mastery increases, k increases to a limit and then 
decreases* In general terms, k appears to be "maximal" at the .60% to 80% 
criterions for mastery scores* These results concur somewhat with Huynh's 
findings (Huynli. 1976) in which k was maximal at, the 65% to 75% criterions\ 
for mastery scores for three achievement tests, Huynh (1976) explains 
this occurrence partly by stating that Pg approximates 1 when the cutoff ' 
(criterion for mastery) is too small or too large* Therefore, there is 
not mucli room for the "improvement" of the consistency of decisioiis beyond 
the ehanco level. 

One other point should be noted from the results obtained in Table 2* 
The k values for the learning for mastery students (experimental group) 
appear to be coneiBtently higher than the correeponding k values for 
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the control group with the ©Kception of the suimatlve pra tests in which 
k approKimates 0 for both groups. 

In essence, there appears to be a positive relationship between the 
reliability (i.e.^ k cDnceived as an Internal consistency estimate) of a 
logically derived critarion^ref erenced test and the ainount (or quality) 
of Instruction received* If a test is well defined in teme of content 
covered and behavior levels required , the internal consistency between 
logically developed subtests may be dependent somewhat on the '^eaning-^ 
fulness'" of the instruction received* 

Table 3 depicts the means and standard deviations as well as tradi^ 
tlonal reliability estimates for each of the 10 criterion-referenced 
tests involved. 



Insert Table 3 about here. 



The traditional reliability estimates observed were the Kuder--Mchardson 
Formula 20 (KR20) and the Spearman-Erown prophecy formula. The "split- 
halves" of the Spearman-^Brown were Identical to the "subtests" used in 
measuring kappa* As observed in Table 3^ there appears to be a high positive 
relationship between these Wo traditional estimates of reliability* 

Whether or not classical test theory can be applied to a criterion- 
referenced framework has been a debated issue in recent years* One camp 
advocates that the concept of variability In test scores la Irrelevant 
with criterion-referenced tests (e.g., Popham and Huseks 1969 iMlllman and 
Popham, 1974) • The other camp emphasizes that variability has been observed 
in criterion-referenced testing and is an Important concept to be considered 
(e.g. J WoodDon^ 1974r Haladynai 1974), The question remains unresolved as 
to whether classical test measurement (e.g* , 020* Spearman-Brown ^ etc.) is 
appropriate for evaluation criterion-referenced testb* 

As noted in Table 3* teat scores were moderately haterogeneous in th^ 
experimental group (learning for mastery students). From a theoretical ' 
standpoint, Bloom (1976) indicates that, In such a learning for mastery set- 
tlngi scores would tend to be more homogeneousl . In a mastery learning in- 
structional model, test scores should become higher and less variant in 
nature. . , 

In an effort to observe what effects would occur in more homogeneous 
settings, data was simulated (N ^ 100 cases; test length ^ 20 Items) to 



6 



approximate various "stages" of negatively skewed scoring distributions-'^ 
kappa coefficiants KR20s and the Spearman^Brown prophecy formula were 
compared on the following four simulated data setss 

(1) X ^ 10, SD - 4 

(2) X ^ 13, SD - 3 . 

(3) X ^ 16j SD ^ 2 ' 

(4) X ^ 19, SD - 1, 

Data set (1) represents an approximate normal distribution of scores 
whereas data sets ^2), (S), and (4) are negatively skewed with geometrically 
Inereaaing means (X's) and decreasing standard deviations (SD^s). 

Table 4 denotes the values of kappa (criterion for mastery ^ 80Z) , 1^20, 
and Spearman^Brown prophecy formula for the four simulated data sets. When 
certain properties affecting reliability were held constant (l,e., N ^ 100 
cases; test length ^ 20 items) i all three internal consistency estimates 
(K, KH20i and Split^Half) were influenced by the decreasing variability in 
test scores. As variability decreased, so did the internal consistency 
estimates* In fact, kappa (criterion for mastery ^ 80%) approKimates the 
value of 0 in data sets (3) and (4), Like the more traditional estimates of 
Internal consistency (KR20j Spearman^Brown) i Interpretation of kappa becomes 
suspect when variability in test scores decreases, 

tenclus ion 

The coefficient k appears to be well suited for being used as an internal 
consistency estimate for criterion-refeirenced tests in single test adminiS'- 
trations. The use of k in test^retest reliability may be too cumbersome a 
process for teacher=made criterion-referenced tests , There appears to be a 
need for an internal reliability estimate to indicate the appropriateness of 
"master" versus "nonmaster" in a single test administration. With the 
advent of increasing sophistication in criterion-referenced test develop- 
ment, there also appears the neetf for equating "logical" split-half 



In order to preserve the definition of kappa esppused by the authors, 
the simulated data sets were based on actual item responses made by the 
93 students on the third learning eKercJ^e in ecosystems. Thus, if tt^o 
teat scores of "18" were required for a simulated data set, two scores of 
"18" were randomly selected from a pool of students who actually scored 
"18" on the learning eKerclse, In the case where there, were feweir than 
three students who actually attained a particular desired score, the next 
highest score (s) was (were) modified by adding one (or more) randomly 
selected item(s)* Therefore, each of the test scores represented in the 
simulated data was randomly selected from a pool of at least three or 
more actual test scores* 
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reliability by using both babavior levels and content areas as criteria 
for forming mirrored subtests* Therafore, a ''master" (or "noMiaster") has 
the added meaning of achieving (or failing to meet) a prespeclfiad 
criterion as wall as achieving "consistency of mastery" on two logically 
derived subtests of a crlterion^ref erenced test. 

Unlike other .reliability estimates , kappa (k) is concerned with. the 
reliability of classifications^ not with the reliability of scores. In a 
criterlon'^referenced testing atmospheres there is a ' led for consistency in 
declsion^making Ce.g,j "master" vs. "nonmaster" classification) whether or 
not variability (in test scores) is present. However j as Swaminathan et al 
(1974) have pointed out, kappa is situation specific, and theref ore, addl- 
tional information as criterion for mastery score, test scort variability s 
test lengthy etc* should be reported along with this Inde^ for interpreta- 
tion. 
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■ lABLl 1 

Output of Ovgriil Statiitlcs of Kappa Based oa 93 Students 





. 

Criterion for Mastery ^ W, 


Gritirion for Mastiry « 801 


, Test Type 


IM ^ 




h 


k 




INM ■ ^ 


P 

0 


, I? 


lianiing 

Exetclses 

(EcQiyitiii) 


1 




70 ^/ 
751 


.817 


.306 


1 

11 


881 


.892 


.ill . 


2 


13 

141 


621 


.763 


.383 




75 / 

ail 


.839 


.195 ' 


3 


•12 


48 >^ 
./"^ 521 


.645 


.241 


B 

91 


64 

Z"^: 691 


.774 


.335 


Suffifflitivi Pretest 
(Ecos|stiffii) 


D 


88 


,946 


-'018 


0 


92 

99i 


.989 


.000 - 


Sumatlve Fostteit 
(Ecesystens) 


21 ^ 
291 


43 / 
461 


.753 


.493 


/^15l 


59 / 

/^'^ 631 


.785 


•^38 ^ 


Liaming 
(Geofflitry) 


1 




65 ^ 

m 


.795 


.391 


3 

/"'^ 31 


77 

^ 83^ 


.860 


.250 


2 


1 ^ 
11 


BO 

../^ 861 


.871 


.079 


0 

01 _ 


91 / 


.978 


-.OIL? 


3 




69 

...^ 741 ^ 


.806 


.285 


2 

21 


82 

m 


.903 


.259 ( 


: Sumatlve Pretest 
(Geoniitry) 


1 

.--^ 11 


16 ■ >^ 
931 


.935 


.236 


0 ^ 


/^_9ai 


,978 


,000 V 


Suifiatlvi Posttist 
(Geotnet^) 


12 / 




.796 


.428 


101 


74 

801 


,892 


.580;:; 



TOE! 

fl numkr of listers in _both subtests 
M ^ percentagi of iaitsri in both suitests 
iNM ^ nuraber of nonmaitirs in both iubtests : 
%M ^ percentage of nonMSters In both subtiits 



^ o 'Pq " the observed propottion of agreeaent of Diastari and nonmistets in ^ subtasti 
kJi^ ■ ■ proportion of agreement that exists over and above that which can be expected 
• - ■ ■ - by chance alone - ■■ ._ ■ ■ 



TABLE 2 ■ • 

Kappa As A Function of Crltsrlon For Mastery Sesrcs ' . 





Total (N=93) 


EKperlmental (H«47) 


Contrsi (N«4§) 


Teat'TvBi 


Criterion For Mastery 

m m 801, 901 1001 


Criterion lor Mastery 

■m m m m looi 


Critirloi 
60;i 701 


i-For 1 

m 


faster 
901, 


y 

1001 


Leanln? 

Exarclses 

(leogystiffls) 


1 


.083 


.306 


,111 


.000 


.000 


.085 


.290 


.109 


.000 


.000 


.030 


.238 


-.^022 


■ ".OHE 


ONE- 


2 


■ .433 


.383 


,195 


.000 


ONE 


.402 


.382 


.178 


.000 " 


'ONE 


.370 


.107 


.000 


on 


ONE 

■_ — 


3 




.241 


.335 


.069 


.000 


.368 


.222 


.287 


.075 


.000 


■^.280 


.143 


.33J 


.000 


.000' 


4 Pre (Ecosvstiii 


) 


-.OIL 


-.018 


.000 


.000 




-.079 


-.029 


.000 


.000 " 


ONE 


-.038 


.000 


m 


ONE 


one' 


IPost (IcosviteBi) 


.511 


.493 


.438 


.294' _ 


,m__ 


.599 


.421 


.386 


.301 


.000 


.249^ 


.388- 


-r340 


■i 

-.022 


-ONE- 


- Liaining 
(Giometry) 


■i 


.402 


.391 


.250 


-.022 


ONE 


.439 


.424 


.225 


=.044 


ONE 


.258 


-.038 


.000 


^ ONE 


ONE 


2 


.139 


.079 


-.Oil 


ONE 


:one; 


.183 


.141 


=.022 


ONE" 


'ONE 


-.073 


-.036 


ONE 


ONE 


one; 


3 


.256 


.285; 


.259 


.258 


.000 


.276 


.217 


.332 


.368 


.000 


,033 


.287 


-.030 


-.022 


01 


i Prp fGeometrvV 




.179 


..235 


.000 


.000 _ 


^ ONE ■ 


.208 


,221 


.000 


.000 


ONE 


- .000 


ONE 


ONE 


_""ONE 


ONL 


iPoit {Giometry) 


.408 


.428 


.580 


.239 


.000 


.395 


.309 


.533 


.187 


.000 


.045 


-.030 


.000 


ONE 


ONE 



TOE: 

- Expertotal = learning for aastery itudenti 
Criterion for Mastery ^ pireent of iteis correct in: both aubteste 

ONE s the case where no one has lastered eithir lubtest 



TABLE 3 

Meaas, Standard Beviitions, Riliabillty Istiinatii 





Total frP3) 


Expirliental (H=47) 


. '..Control (M 




TtstType 




Mean (SD) 


KR-20 , 


SpUt-Hilf 


Mean (SD) 


•E20 ■ 


Split-Half 


Mean (SD) 


KI20 


Split=H8l 


- ■ Lef.ming 
Exercises 
_ (Ecosystanis) 


1 


9.02 (3.20) 


\ .603 


. J90 


9.49 (3.67) 


.697 


.621 


8.J4 (2.61) 


.413 


.sor 


2 


9.91 (3.43) 




.698 


11.02 (3.56) 


.712 


.707 


8.78 (2.91) 


.533 


.614 = 


3 


10.69 (3.86) 


,731 


.774 


11.79 (4.03) 


.772 


.801 


9.57 (3,37) 


.629 


.685-- 


iPre (Icosystiffli) 


6.69 (2.3S) 


,270 


.146 


6.81 (2.59) 


'.403 


.347 


6.57 (2.09) 


.079 


-.163 " 


iPost (Ecosystenis) 


11.38 (4.05) 


.775 


.808 


13.28 (3.71) 


.764 


.804 


9.44 (3.46) 


.661 


' .728,^ 


Leaming 
Exercises 
(Geoiitry) • 


1 


9.33 (3,69) 


.719 


.763 


10.57 (3.93) 


.767 


.803 


8.07 (2.97) 


.548 


.609. 


2 


8.02 (2.44) 


.391 


.376 


9.00 (2.49) 


.409 


.500 


7.02 (L94) 


.083 


-.122 . 


3 


9.44 (3.16) 


.597 


,560 


10.62 (3.17) 


.602 


.564 


8.24 (2.69) 


.450 


.377 - 


iPre (Gioinetry) 


5.62 (2.?2) 


.544 


.544 


6.66 (2.97) 


.611 


.586 


4.57 (1.95) 


.149 


■ .191-" 


iPoit (Geoietry) 


9.46 (4.16) 


.782 


,765 


11.66 (4.18) 


.794 


: .740 ^ 


7.22 (2.71) 


.484 


.540 '. 



NOTE ; 

Ixpiilnental = learning for lastery students 
SD B itandard deviation 
E 20 s Kuder-Elchardson foriula 20 



Split-Half ' Speanan-Brom Prophecy FonuU; 




Internal Reliability Estimates for Four Simulated Data Seta (N » 100) , 



Data Set 


Mean (SD) 


Skewness 


\ ■ ■ , 
I&irtosls 


K 


im20 


Split-Half 




10.00 (4.05) 




:/ -.331 


.328 


.753 


.777 


2 


13.02 (2.97) 


-.503 


-.035 


.126 


.575 


.644 


3 


15.98 (1.97) 


-.352 


-.210 


.083, 


.334 


.425 . 


4 


18.98 (1.08) 


-1.162 


; 1.255 


.000 


.337 


.507 



SD ^ standard deviatisn 

K^. kappa eoefflelent Ceritir^^^ 80%) 
raaO Kuder-Richardson Formula 20 
Spllt-^Half ^ Spearmn-BroTO Propheey F 
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